Setup and Context¶

Introduction¶

Welcome to Boston Massachusetts in the 1970s! Imagine you're working for a real estate development company. Your company wants to value any residential project before they start. You are tasked with building a model that can provide a price estimate based on a home's characteristics like:

  • The number of rooms
  • The distance to employment centres
  • How rich or poor the area is
  • How many students there are per teacher in local schools etc

To accomplish your task you will:

  1. Analyse and explore the Boston house price data
  2. Split your data for training and testing
  3. Run a Multivariable Regression
  4. Evaluate how your model's coefficients and residuals
  5. Use data transformation to improve your model performance
  6. Use your model to estimate a property price

Upgrade plotly (only Google Colab Notebook)¶

Google Colab may not be running the latest version of plotly. If you're working in Google Colab, uncomment the line below, run the cell, and restart your notebook server.

In [1]:
# %pip install --upgrade plotly

Import Statements¶

In [2]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

Notebook Presentation¶

In [3]:
pd.options.display.float_format = '{:,.2f}'.format

Load the Data¶

The first column in the .csv file just has the row numbers, so it will be used as the index.

In [4]:
data = pd.read_csv('boston.csv', index_col=0)

Understand the Boston House Price Dataset¶


Characteristics:

:Number of Instances: 506 

:Number of Attributes: 13 numeric/categorical predictive. The Median Value (attribute 14) is the target.

:Attribute Information (in order):
    1. CRIM     per capita crime rate by town
    2. ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
    3. INDUS    proportion of non-retail business acres per town
    4. CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
    5. NOX      nitric oxides concentration (parts per 10 million)
    6. RM       average number of rooms per dwelling
    7. AGE      proportion of owner-occupied units built prior to 1940
    8. DIS      weighted distances to five Boston employment centres
    9. RAD      index of accessibility to radial highways
    10. TAX      full-value property-tax rate per $10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    13. LSTAT    % lower status of the population
    14. PRICE     Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. You can find the original research paper here.

Preliminary Data Exploration 🔎¶

Challenge

  • What is the shape of data?
  • How many rows and columns does it have?
  • What are the column names?
  • Are there any NaN values or duplicates?
In [5]:
data
Out[5]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
0 0.01 18.00 2.31 0.00 0.54 6.58 65.20 4.09 1.00 296.00 15.30 396.90 4.98 24.00
1 0.03 0.00 7.07 0.00 0.47 6.42 78.90 4.97 2.00 242.00 17.80 396.90 9.14 21.60
2 0.03 0.00 7.07 0.00 0.47 7.18 61.10 4.97 2.00 242.00 17.80 392.83 4.03 34.70
3 0.03 0.00 2.18 0.00 0.46 7.00 45.80 6.06 3.00 222.00 18.70 394.63 2.94 33.40
4 0.07 0.00 2.18 0.00 0.46 7.15 54.20 6.06 3.00 222.00 18.70 396.90 5.33 36.20
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501 0.06 0.00 11.93 0.00 0.57 6.59 69.10 2.48 1.00 273.00 21.00 391.99 9.67 22.40
502 0.05 0.00 11.93 0.00 0.57 6.12 76.70 2.29 1.00 273.00 21.00 396.90 9.08 20.60
503 0.06 0.00 11.93 0.00 0.57 6.98 91.00 2.17 1.00 273.00 21.00 396.90 5.64 23.90
504 0.11 0.00 11.93 0.00 0.57 6.79 89.30 2.39 1.00 273.00 21.00 393.45 6.48 22.00
505 0.05 0.00 11.93 0.00 0.57 6.03 80.80 2.50 1.00 273.00 21.00 396.90 7.88 11.90

506 rows × 14 columns

In [6]:
data.dtypes
Out[6]:
CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
PRICE      float64
dtype: object
In [7]:
display(data.shape)
(506, 14)
In [8]:
print(f"The data has {data.shape[0]} rows and {data.shape[1]} columns.")
The data has 506 rows and 14 columns.

To get a list of the column names:

In [9]:
data.columns
Out[9]:
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'PRICE'],
      dtype='object')

Data Cleaning - Check for Missing Values and Duplicates¶

In [10]:
data.isna().any().any()
Out[10]:
False
In [11]:
data.duplicated().any()
Out[11]:
False

--> No NaN and no duplicated values.

Descriptive Statistics¶

Challenge

  • How many students are there per teacher on average?
  • What is the average price of a home in the dataset?
  • What is the CHAS feature?
  • What are the minimum and the maximum value of the CHAS and why?
  • What is the maximum and the minimum number of rooms per dwelling in the dataset?
In [12]:
data_stats = data.describe()
data_stats
Out[12]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
count 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00 506.00
mean 3.61 11.36 11.14 0.07 0.55 6.28 68.57 3.80 9.55 408.24 18.46 356.67 12.65 22.53
std 8.60 23.32 6.86 0.25 0.12 0.70 28.15 2.11 8.71 168.54 2.16 91.29 7.14 9.20
min 0.01 0.00 0.46 0.00 0.39 3.56 2.90 1.13 1.00 187.00 12.60 0.32 1.73 5.00
25% 0.08 0.00 5.19 0.00 0.45 5.89 45.02 2.10 4.00 279.00 17.40 375.38 6.95 17.02
50% 0.26 0.00 9.69 0.00 0.54 6.21 77.50 3.21 5.00 330.00 19.05 391.44 11.36 21.20
75% 3.68 12.50 18.10 0.00 0.62 6.62 94.07 5.19 24.00 666.00 20.20 396.23 16.96 25.00
max 88.98 100.00 27.74 1.00 0.87 8.78 100.00 12.13 24.00 711.00 22.00 396.90 37.97 50.00
In [13]:
print(f"There are {data_stats.loc['mean'].PTRATIO:.0f} students "\
      "per teacher on average.")
There are 18 students per teacher on average.
In [14]:
print("The average price of a home in the dataset is "\
      f"${data_stats.loc['mean'].PRICE * 1000:,.0f}.")
The average price of a home in the dataset is $22,533.

We've experienced a lot of inflation and house price appreciation since then!

CHAS feature: "CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)"

"Tract housing is a type of housing development in which multiple similar houses are built on a tract (area) of land that is subdivided into smaller lots."

So basically CHAS has value 1 if the home land area borders the river.

As such, it only has the value 0 or 1. This kind of feature is also known as a dummy variable.

In [15]:
print("In the dataset, the maximum avg number of rooms "\
      f"per dwelling is {data_stats.loc['max'].RM:.2f} "\
      f"and the minimum is {data_stats.loc['min'].RM:.2f}.")
In the dataset, the maximum avg number of rooms per dwelling is 8.78 and the minimum is 3.56.

Visualise the Features¶

Challenge: Having looked at some descriptive statistics, visualise the data for your model. Use Seaborn's .displot() to create a bar chart and superimpose the Kernel Density Estimate (KDE) for the following variables:

  • PRICE: The home price in thousands.
  • RM: the average number of rooms per owner unit.
  • DIS: the weighted distance to the 5 Boston employment centres i.e., the estimated length of the commute.
  • RAD: the index of accessibility to highways.

Try setting the aspect parameter to 2 for a better picture.

What do you notice in the distributions of the data?

The size of the graph must be managed with the height= and aspect= parameters, as seen previously.

height: Height (in inches) of each facet.

aspect: Aspect ratio of each facet, so that aspect height gives the width of each facet in inches.*

House Prices 💰¶

In [16]:
plt.figure(dpi=200)
with sns.axes_style('darkgrid'):
    dis = sns.displot(data=data, x='PRICE',
                      height=4, aspect=2,
                      kde=True, color='darkorange',
                      bins=50)
    dis.set(xlabel='Price in $ thousands')
plt.title(f'1970s Home Values in Boston. '\
          f'Average: ${(1000*data.PRICE.mean()):,.0f}')
plt.show()
<Figure size 1280x960 with 0 Axes>

Remark: We can note that there is a spike in the number of homes at the very right tail at the $50,000 mark.

Distance to Employment - Length of Commute 🚗¶

In [17]:
plt.figure(dpi=200)
with sns.axes_style('darkgrid'):
    dis = sns.displot(data=data, x='DIS',
                      height=4, aspect=2,
                      kde=True, color='green',
                      bins=25)
    dis.set(xlabel='Weighted Distance to 5 Boston Employment Centres')
plt.title(
  f'Distance to Employment Centres. Average: {(data.DIS.mean()):.2}')
plt.show()
<Figure size 1280x960 with 0 Axes>

Remark: Most homes are about 3.8 miles away from work. There are fewer and fewer homes the further out we go.

Number of Rooms¶

In [18]:
plt.figure(dpi=200)
with sns.axes_style('darkgrid'):
    dis = sns.displot(data=data, x='RM',
                      height=4, aspect=2,
                      kde=True, color='firebrick'
                     )
    dis.set(xlabel='Average Number of Rooms per Dwelling')
plt.title(
  f'Distribution of Rooms in Boston. Average: {data.RM.mean():.2}')
plt.show()
<Figure size 1280x960 with 0 Axes>

Access to Highways 🛣¶

Note: We can change the width of the bins with binwidth=0.5 for example.

In [19]:
plt.figure(dpi=200)
with sns.axes_style('darkgrid'):
    dis = sns.displot(data=data, x='RAD',
                      height=4, aspect=2,
                      kde=True, color='darkblue',
                      bins=25,
                      ec='black') # ec colors the border of the bin
    dis.set(xlabel='Index of Accessibility to Radial Highways')
plt.show()
<Figure size 1280x960 with 0 Axes>

Remark: RAD is an index of accessibility to roads. Better access to a highway is represented by a higher number. There's a big gap in the values of the index.

Next to the River? ⛵️¶

Challenge

Create a bar chart with plotly for CHAS to show many more homes are away from the river versus next to it.

You can make your life easier by providing a list of values for the x-axis (e.g., x=['No', 'Yes'])

Regarding the display of NumPy's arrays:

By default, the view is only summarized/truncated with 1000 elements. We can change the display options with .set_printoptions().

threshold is the nb of elements before an array is summerized, edgeitems is the number at beginning and end of summary.

I can also set the display options temporarily like so with np.printoptions()

In [20]:
yes_no_river = np.where(data.CHAS == 1, 'Yes', 'No')
len(yes_no_river)
Out[20]:
506
In [21]:
# np.set_printoptions(threshold=500, edgeitems=10)
with np.printoptions(threshold=500, edgeitems=10):
    display(yes_no_river)
array(['No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', ...,
       'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No', 'No'],
      dtype='<U3')
In [22]:
fig = px.histogram(x=yes_no_river, histfunc='count',
                   color=yes_no_river,
                   labels={'x': 'Property Located Next to the River?'},
                   title='Next to Charles River?',
                   # color_discrete_sequence=px.colors.sequential.haline
                  )
fig.update_layout(showlegend=False, yaxis_title='Count')
fig.show()

Course Solution:

In [23]:
river_access= data.CHAS.value_counts()
river_access
Out[23]:
0.00    471
1.00     35
Name: CHAS, dtype: int64
In [24]:
fig = px.bar(x=['No', 'Yes'], y=river_access.values,
             color=river_access.values,
             labels={'x': 'Property Located Next to the River?',
                   'y': 'Count'},
             title='Next to Charles River?',
             color_continuous_scale='haline')
fig.update_layout(coloraxis_showscale=False)
fig.show()

We see that out of the total number of 506 homes, only 35 are located next to the Charles River.

Understand the Relationships in the Data¶

Run a Pair Plot¶

Challenge

There might be some relationships in the data that we should know about. Before you run the code, make some predictions:

  • What would you expect the relationship to be between pollution (NOX) and the distance to employment (DIS)?
  • What kind of relationship do you expect between the number of rooms (RM) and the home value (PRICE)?
  • What about the amount of poverty in an area (LSTAT) and home prices?

Run a Seaborn .pairplot() to visualise all the relationships at the same time. Note, this is a big task and can take 1-2 minutes! After it's finished check your intuition regarding the questions above on the pairplot.

In [25]:
data[['NOX', 'DIS', 'RM', 'PRICE', 'LSTAT']]
Out[25]:
NOX DIS RM PRICE LSTAT
0 0.54 4.09 6.58 24.00 4.98
1 0.47 4.97 6.42 21.60 9.14
2 0.47 4.97 7.18 34.70 4.03
3 0.46 6.06 7.00 33.40 2.94
4 0.46 6.06 7.15 36.20 5.33
... ... ... ... ... ...
501 0.57 2.48 6.59 22.40 9.67
502 0.57 2.29 6.12 20.60 9.08
503 0.57 2.17 6.98 23.90 5.64
504 0.57 2.39 6.79 22.00 6.48
505 0.57 2.50 6.03 11.90 7.88

506 rows × 5 columns

In [26]:
sns.pairplot(data[['NOX', 'DIS', 'RM', 'PRICE', 'LSTAT']],
             kind='reg', # to include a regression line
             plot_kws={'line_kws':{'color': 'cyan'},
                       'lowess': True})
plt.show()

We can also do the pairplot on the whole dataset, but it makes a huge plot that is very difficult to read: sns.pairplot(data)

We can clearly see some trend:

  • the further away from the job place, the less polluted
  • the more rooms, the higher the price
  • the poorer the area, the lower the price
  • the poorer the area, the less rooms in homes on average

Challenge

Use Seaborn's .jointplot() to look at some of the relationships in more detail. Create a jointplot for:

  • DIS and NOX
  • INDUS vs NOX
  • LSTAT vs RM
  • LSTAT vs PRICE
  • RM vs PRICE

Try adding some opacity or alpha to the scatter plots using keyword arguments under joint_kws.

Distance from Employment vs. Pollution¶

Challenge:

Compare DIS (Distance from employment) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as the distance increases?

In [27]:
g = sns.jointplot(data[['DIS', 'NOX']],
                  x='DIS', y='NOX',
                  kind='reg', ylim=(0.35, 0.9),
                  joint_kws={'scatter_kws': {'edgecolor': 'white',
                                             'alpha': 0.5},
                             'line_kws': {'color': 'orange'},
                             'lowess': True})
plt.show()

We see that pollution goes down as we go further and further out of town. This makes intuitive sense. However, even at the same distance of 2 miles to employment centres, we can get very different levels of pollution. By the same token, DIS of 9 miles and 12 miles have very similar levels of pollution.

Proportion of Non-Retail Industry 🏭🏭🏭 versus Pollution¶

Challenge:

Compare INDUS (the proportion of non-retail industry i.e., factories) with NOX (Nitric Oxide Pollution) using Seaborn's .jointplot(). Does pollution go up or down as there is a higher proportion of industry?

In [28]:
g = sns.jointplot(data[['INDUS', 'NOX']],
                  x='INDUS', y='NOX',
                  kind='reg', color='indigo',
                  joint_kws={'scatter_kws': {'edgecolor': 'white',
                                             'alpha': 0.3},
                             'line_kws': {'color': 'darkred'}
                            },
                 )
plt.show()

% of Lower Income Population vs Average Number of Rooms¶

Challenge

Compare LSTAT (proportion of lower-income population) with RM (number of rooms) using Seaborn's .jointplot(). How does the number of rooms per dwelling vary with the poverty of area? Do homes have more or fewer rooms when LSTAT is low?

In [29]:
g = sns.jointplot(data[['LSTAT', 'RM']],
                  x='LSTAT', y='RM',
                  kind='reg', color='orangered',
                  joint_kws={'scatter_kws': {'edgecolor': 'white',
                                             'alpha': 0.7},
                             'line_kws': {'color': 'dimgrey'},
                            }
                 )
plt.show()

In the top left corner we see that all the homes with 8 or more rooms, LSTAT is well below 10%.

% of Lower Income Population versus Home Price¶

Challenge

Compare LSTAT with PRICE using Seaborn's .jointplot(). How does the proportion of the lower-income population in an area affect home prices?

In [30]:
g = sns.jointplot(data[['LSTAT', 'PRICE']],
                  x='LSTAT', y='PRICE',
                  kind='scatter', color='darkgreen',
                  joint_kws={'edgecolor': 'white',
                             'alpha': 0.7},
                 )
plt.show()

Number of Rooms versus Home Value¶

Challenge

Compare RM (number of rooms) with PRICE using Seaborn's .jointplot(). You can probably guess how the number of rooms affects home prices. 😊

In [31]:
g = sns.jointplot(data, x='RM', y='PRICE',
                  kind='reg', ylim=(3, 53),
                  color='maroon',
                  joint_kws={'scatter_kws': {'edgecolor': 'white',
                                             'alpha': 0.8},
                             'line_kws': {'color': 'darkgoldenrod'},
                             'lowess': True
                  }
                 )
plt.show()

Again, we see those homes at the $50,000 mark all lined up at the top of the chart. Perhaps there was some sort of cap or maximum value imposed during data collection.

Split Training & Test Dataset¶

We can't use all 506 entries in our dataset to train our model. The reason is that we want to evaluate our model on data that it hasn't seen yet (i.e., out-of-sample data). That way we can get a better idea of its performance in the real world.

In other words, we don't want to train our model with all 506 entries, and then take it out in the world already. What we want to do instead is train the model with, say, 80% of the data, and then test it on the remaining 20%. That way, we can already test the model with real data that the model hasn't seen yet.

Challenge

  • Import the train_test_split() function from sklearn --> good documentation!
  • Create 4 subsets: X_train, X_test, y_train, y_test
  • Split the training and testing data roughly 80/20.
  • To get the same random split every time you run your notebook use random_state=10. This helps us get the same results every time and avoid confusion while we're learning ("random_state: Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls.")

Hint: Remember, your target is your home PRICE, and your features are all the other columns you'll use to predict the price.

In [32]:
from sklearn.model_selection import train_test_split
In [33]:
X = data.drop('PRICE', axis='columns')
y = data.PRICE
In [34]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=10)

Multivariable Regression¶

In a previous lesson, we had a linear model with only a single feature (our movie budgets):

$$ REV \hat ENUE = \theta _0 + \theta _1 BUDGET$$

And we did the following:

regression = LinearRegression()

# Explanatory Variable(s) or Feature(s)
X = new_films.USD_Production_Budget.to_frame()
# Response Variable or Target
y = new_films.USD_Worldwide_Gross.to_frame()

regression.fit(X, y)

# Theta zero
regression.intercept_
# Theta one
regression.coef_

# R-squared -> how well our model fits our data
# Anything over 50% is a very decent model
regression.score(X, y)

# Running the model
budget = 350000000
revenue_estimate = regression.intercept_[0] + regression.coef_[0,0]*budget

This time we have a total of 13 features. Therefore, our Linear Regression model will have the following form:

$$ PR \hat ICE = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta _3 DIS + \theta _4 CHAS ... + \theta _{13} LSTAT$$

SO I should have 13 coefficients (theta 1 to 13), and one intercept (theta 0).

Run Your First Regression¶

Challenge

Use sklearn to run the regression on the training dataset. How high is the r-squared for the regression on the training data?

In [35]:
regression = LinearRegression()
regression.fit(X_train, y_train)
Out[35]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [36]:
regression.intercept_
Out[36]:
36.53305138282448
In [37]:
regression.coef_
Out[37]:
array([-1.28180656e-01,  6.31981786e-02, -7.57627602e-03,  1.97451452e+00,
       -1.62719890e+01,  3.10845625e+00,  1.62922153e-02, -1.48301360e+00,
        3.03988206e-01, -1.20820710e-02, -8.20305699e-01,  1.14189890e-02,
       -5.81626431e-01])
In [38]:
rsquare = regression.score(X_train, y_train)
rsquare
Out[38]:
0.750121534530608

Wow, R-squared at 75%! That is amazing!

In [39]:
regression.score(X_test, y_test)
Out[39]:
0.6709339839115642

And 67% on the test data, this is still very good!

Evaluate the Coefficients of the Model¶

Here we do a sense check on our regression coefficients. The first thing to look for is if the coefficients have the expected sign (positive or negative).

Challenge Print out the coefficients (the thetas in the equation above) for the features. Hint: You'll see a nice table if you stick the coefficients in a DataFrame.

  • We already saw that RM on its own had a positive relation to PRICE based on the scatter plot. Is RM's coefficient also positive?
  • What is the sign on the LSAT coefficient? Does it match your intuition and the scatter plot above?
  • Check the other coefficients. Do they have the expected sign?
  • Based on the coefficients, how much more expensive is a room with 6 rooms compared to a room with 5 rooms? According to the model, what is the premium you would have to pay for an extra room?
In [40]:
data.iloc[:, :13].columns
Out[40]:
Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')
In [41]:
coeffs = pd.DataFrame(regression.coef_,
                      index=data.iloc[:, :13].columns,
                      columns=['Coef']).T
coeffs
Out[41]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
Coef -0.13 0.06 -0.01 1.97 -16.27 3.11 0.02 -1.48 0.30 -0.01 -0.82 0.01 -0.58

According to our previous relationships:

  • PRICE goes down as Poverty (LSTAT) goes up, and we have a negative coef so good 👍
  • PRICE goes up as nb of rooms (RM) goes up, and we have a stron positive coef ✔️
  • PRICE goes down a pollution (NOX) goes up, and we have a strong negative coef 👌
In [42]:
print(f"According to the model, you'd have to "\
      f"pay about ${float(coeffs.RM)*1000:.0f} more for an extra room.")
According to the model, you'd have to pay about $3108 more for an extra room.

Analyse the Estimated Values & Regression Residuals¶

The next step is to evaluate our regression. How good our regression is depends not only on the r-squared. It also depends on the residuals - the difference between the model's predictions ($\hat y_i$) and the true values ($y_i$) inside y_train.

In [43]:
predicted_values = regression.predict(X_train)
print(f"Length of the predicted values array: {len(predicted_values)}\n")
with np.printoptions(threshold=100, edgeitems=5):
    display(predicted_values)
Length of the predicted values array: 404

array([21.02958601, 12.21844467, 13.74785342, 20.7351517 , 23.41262356,
       ..., 24.84599397, 19.89193201, 19.16746008, 22.40634226,
       28.57061195])
In [44]:
residuals = (y_train - predicted_values)
print(f"Length of the residuals array: {len(predicted_values)}\n")
residuals
Length of the residuals array: 404

Out[44]:
50    -1.33
367   10.88
34    -0.25
78     0.46
172   -0.31
       ... 
320   -1.05
15     0.01
484    1.43
125   -1.01
265   -5.77
Name: PRICE, Length: 404, dtype: float64

Challenge: Create two scatter plots.

The first plot should be actual values (y_train) against the predicted value values.

In [45]:
plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
    g = sns.scatterplot(x=y_train, y=predicted_values,
                        c='indigo', alpha=0.6)
    sns.lineplot(x=y_train, y=y_train, c='cyan')
    g.set(title=f'Actual vs Predicted Prices: $y _i$ vs $\hat y_i$',
          xlabel=f'Actual prices in \$ thousands $y _i$',
          ylabel=f'Predicted prices in \$ thousands $\hat y _i$')
plt.show()

The cyan line in the middle shows y_train against y_train. If the predictions had been 100% accurate then all the dots would be on this line. The further away the dots are from the line, the worse the prediction was. That makes the distance to the cyan line, you guessed it, our residuals 😊

The second plot should be the residuals against the predicted prices.

In [46]:
plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
    g = sns.scatterplot(x=predicted_values, y=residuals,
                        c='indigo', alpha=0.6)
    g.set(title=f'Residuals vs Predicted Values',
          xlabel='Predicted prices in \$ thousands $\hat y _i$',
          ylabel='Residuals')
plt.show()

Why do we want to look at the residuals? We want to check that they look random. Why? The residuals represent the errors of our model. If there's a pattern in our errors, then our model has a systematic bias.

We can analyse the distribution of the residuals. In particular, we're interested in the skew and the mean.

In an ideal case, what we want is something close to a normal distribution. A normal distribution has a skewness of 0 and a mean of 0. A skew of 0 means that the distribution is symmetrical - the bell curve is not lopsided or biased to one side. Here's what a normal distribution looks like:

Challenge

  • Calculate the mean and the skewness of the residuals.
  • Again, use Seaborn's .displot() to create a histogram and superimpose the Kernel Density Estimate (KDE)
  • Is the skewness different from zero? If so, by how much?
  • Is the mean different from zero?
In [47]:
print(f"The mean of our residuals is {residuals.mean():.20f}")
display(residuals.mean())
resid_mean = round(residuals.mean(), 2)
The mean of our residuals is -0.00000000000000367583
-3.675827519154974e-15

Damn! That's as close to zero as you can get!

In [48]:
print(f"The skewness of our residuals is {residuals.skew():.4f}")
resid_skew = round(residuals.skew(), 2)
The skewness of our residuals is 1.4594

That would mean our residuals distribution is not really symmetrical. Let's see:

In [49]:
plt.figure(dpi=120)
with sns.axes_style('ticks'):
    dis = sns.displot(x=residuals, bins=50,
                      kde=True, height=6,
                      color='indigo')
    dis.set(xlim=(-25, 25),
            title=f'Residuals Skew ({resid_skew}) Mean ({resid_mean})')
plt.show()
<Figure size 768x576 with 0 Axes>

We see that the residuals have a skewness of 1.46. There could be some room for improvement here.

Data Transformations for a Better Fit¶

We have two options at this point:

  1. Change our model entirely. Perhaps a linear model is not appropriate.
  2. Transform our data to make it fit better with our linear model.

Let's try a data transformation approach.

Challenge

Investigate if the target data['PRICE'] could be a suitable candidate for a log transformation.

  • Use Seaborn's .displot() to show a histogram and KDE of the price data.
  • Calculate the skew of that distribution.
  • Use NumPy's log() function to create a Series that has the log prices
  • Plot the log prices using Seaborn's .displot() and calculate the skew.
  • Which distribution has a skew that's closer to zero?
In [50]:
data.PRICE.mean()
Out[50]:
22.532806324110677

We can set the xlim at 2*mean = 45, to better see if symmetrical.

In [51]:
print(f"The skewness of the target PRICE is: "\
      f"{data.PRICE.skew():.3f}")
tgt_skew = data['PRICE'].skew()
The skewness of the target PRICE is: 1.108
In [52]:
plt.figure(dpi=120)
with sns.axes_style('darkgrid'):
    dis = sns.displot(x=data.PRICE, bins=50,
                      kde=True, height=5.5,
                      color='green')
    dis.set(title=f'Data PRICE Distribution. Skew is {tgt_skew:.3}',
            xlim=(0, 45))
plt.show()
<Figure size 768x576 with 0 Axes>
In [53]:
log_prices = np.log(data.PRICE)
log_prices
Out[53]:
0     3.18
1     3.07
2     3.55
3     3.51
4     3.59
      ... 
501   3.11
502   3.03
503   3.17
504   3.09
505   2.48
Name: PRICE, Length: 506, dtype: float64
In [54]:
log_skew = log_prices.skew()
log_skew
Out[54]:
-0.33032129530987864
In [55]:
log_prices.mean()
Out[55]:
3.0345128744146552
In [56]:
plt.figure(dpi=120)
with sns.axes_style('darkgrid'):
    dis = sns.displot(x=log_prices, bins=50,
                      kde=True, height=5.5,
                      color='green')
    dis.set(title=f'Log() Data PRICE Distribution. Skew is {log_skew:.3}',
            # xlim=(0, 6)
            xlim=(1.5,4.5)
           )
plt.show()
<Figure size 768x576 with 0 Axes>

The target (log(PRICE)) is actually more symmetrical now.

The log prices have a skew that's closer to zero. This makes them a good candidate for use in our linear model. Perhaps using log prices will improve our regression's r-squared and our model's residuals.

How does the log transformation work?¶

Using a log transformation does not affect every price equally. Large prices are affected more than smaller prices in the dataset. Here's how the prices are "compressed" by the log transformation:

We can see this when we plot the actual prices against the (transformed) log prices.

In [57]:
plt.figure(dpi=150)
plt.scatter(data.PRICE, np.log(data.PRICE))

plt.title('Mapping the Original Price to a Log Price')
plt.ylabel('Log Price')
plt.xlabel('Actual Price in $ thousands')
plt.show()

Regression using Log Prices¶

Using log prices instead, our model has changed to:

$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$

Challenge:

  • Use train_test_split() with the same random state as before to make the results comparable.
  • Run a second regression, but this time use the transformed target data.
  • What is the r-squared of the regression on the training data?
  • Have we improved the fit of our model compared to before based on this measure?
In [58]:
X_train, X_test, log_y_train, log_y_test =\
  train_test_split(X, log_prices, train_size=0.8, random_state=10)
In [59]:
regression_2 = LinearRegression()
regression_2.fit(X_train, log_y_train)
regression_2.intercept_
Out[59]:
4.0599438717751966
In [60]:
coeffs_2 = pd.DataFrame(regression_2.coef_, index=X_train.columns,
                        columns=['Coef']).T
coeffs_2
Out[60]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
Coef -0.01 0.00 0.00 0.08 -0.70 0.07 0.00 -0.05 0.01 -0.00 -0.03 0.00 -0.03
In [61]:
rsquared_2 = regression_2.score(X_train, log_y_train)
print(f'New training data r-squared: {rsquared_2:.2}')
New training data r-squared: 0.79

Even better than before! From 0.75 to 0.79!

In [62]:
round(regression_2.score(X_test, log_y_test),2)
Out[62]:
0.74

And with the test data, also better than before! From 0.67 to 0.74!

Evaluating Coefficients with Log Prices¶

Challenge: Print out the coefficients of the new regression model.

  • Do the coefficients still have the expected sign?
  • Is being next to the river a positive based on the data?
  • How does the quality of the schools affect property prices? What happens to prices as there are more students per teacher?

Hint: Use a DataFrame to make the output look pretty.

In [63]:
coeffs_2
Out[63]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
Coef -0.01 0.00 0.00 0.08 -0.70 0.07 0.00 -0.05 0.01 -0.00 -0.03 0.00 -0.03

The coefficients still make sense. More crime (CRIM) negative, more rooms (RM) positive, more pollution (NOX) highly negative, poverty (LSTAT) negative.

Being next to the river impacts the PRICE positively too.

In [64]:
with pd.option_context('display.min_rows', 4):
    display(data.PTRATIO)
0     15.30
1     17.80
       ... 
504   21.00
505   21.00
Name: PTRATIO, Length: 506, dtype: float64

We have a negative coef for LSTAT, meaning the more students per teacher, the lower the house prices.

Course:

So how can we interpret the coefficients? The key thing we look for is still the sign - being close to the river results in higher property prices because CHAS has a coefficient greater than zero. Therefore property prices are higher next to the river.

More students per teacher - a higher PTRATIO - is a clear negative. Smaller classroom sizes are indicative of higher quality education, so have a negative coefficient for PTRATIO.

Regression with Log Prices & Residual Plots¶

Challenge:

  • Copy-paste the cell where you've created scatter plots of the actual versus the predicted home prices as well as the residuals versus the predicted values.
  • Add 2 more plots to the cell so that you can compare the regression outcomes with the log prices side by side.
  • Use indigo as the colour for the original regression and navy for the color using log prices.
In [65]:
predicted_log_values = regression_2.predict(X_train)
log_residuals = log_y_train - predicted_log_values
In [66]:
plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
    g = sns.scatterplot(x=y_train, y=predicted_values,
                        c='indigo', alpha=0.6)
    sns.lineplot(x=y_train, y=y_train, c='cyan')
    g.set(title=f'Actual vs Predicted Prices: $y _i$ vs $\hat y_i$'\
                f' (R-Squared {rsquare:.2f})',
          xlabel=f'Actual prices in \$ thousands $y _i$',
          ylabel=f'Predicted prices in \$ thousands $\hat y _i$')
plt.show()

plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
    g = sns.scatterplot(x=log_y_train, y=predicted_log_values,
                        c='navy', alpha=0.6)
    sns.lineplot(x=log_y_train, y=log_y_train, c='cyan')
    g.set(title=f'Actual vs Predicted Log Prices: $y _i$ vs $\hat y_i$'\
                f' (R-Squared {rsquared_2:.2f})',
          xlabel=f'Actual Log Prices $y _i$',
          ylabel=f'Predicted Log Prices $\hat y _i$')
plt.show()

plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
    g = sns.scatterplot(x=predicted_values, y=residuals,
                        c='indigo', alpha=0.6)
    g.set(title=f'Residuals vs Predicted Values',
          xlabel='Predicted prices in \$ thousands $\hat y _i$',
          ylabel='Residuals')
plt.show()

plt.figure(figsize=(6,4), dpi=120)
with sns.axes_style('ticks'):
    g = sns.scatterplot(x=predicted_log_values, y=log_residuals,
                        c='navy', alpha=0.6)
    g.set(title=f'Residuals vs Predicted Values for Log Prices',
          xlabel='Predicted Log Prices $\hat y _i$',
          ylabel='Residuals')
plt.show()

It's hard to see a difference here just by eye. The predicted values seems slightly closer to the cyan line, but eyeballing the charts is not terribly helpful in this case.

Challenge:

Calculate the mean and the skew for the residuals using log prices. Are the mean and skew closer to 0 for the regression using log prices?

In [67]:
print(log_residuals.mean(), '\n')
print(f"{log_residuals.mean():.20f}")
resid_mean_2 = round(log_residuals.mean(),2)
8.381634220561207e-16 

0.00000000000000083816

The mean is even closer to zero as before. But both are at zero anyway.

In [68]:
print(log_residuals.skew())
resid_skew_2 = round(log_residuals.skew(), 2)
0.09299942594123323

The skew is now at 0.093 compared to 1.46 before, so definitely an improvement.

In [69]:
plt.figure(dpi=120)
with sns.axes_style('ticks'):
    dis = sns.displot(x=residuals, bins=50,
                      kde=True, height=6,
                      color='indigo')
    dis.set(xlim=(-25, 25),
            title=f'Previous Model: Residuals Skew ({resid_skew}) Mean ({resid_mean})')
plt.show()

plt.figure(dpi=120)
with sns.axes_style('ticks'):
    dis = sns.displot(x=log_residuals, bins=50,
                      kde=True, height=6,
                      color='navy')
    dis.set(title=f'Log Price Model: Residuals Skew ({resid_skew_2}) Mean ({resid_mean_2})')
plt.show()
<Figure size 768x576 with 0 Axes>
<Figure size 768x576 with 0 Axes>

Yep, we can definitely see that the distribution of the residuals, with the log model, is more symmetrical.

Course: Our new regression residuals have a skew of 0.09 compared to a skew of 1.46. The mean is still around 0. From both a residuals perspective and an r-squared perspective we have improved our model with the data transformation.

Compare Out of Sample Performance¶

The real test is how our model performs on data that it has not "seen" yet. This is where our X_test comes in.

Challenge

Compare the r-squared of the two models on the test dataset. Which model does better? Is the r-squared higher or lower than for the training dataset? Why?

In [70]:
print(regression.score(X_test, y_test))
print(regression_2.score(X_test, log_y_test))
0.6709339839115642
0.744692230626074

The log model does better with 0.74 compared to 0.67

However they both do worse on the test data than on the training data, which is normal, this is data it's never seen before.

Course: By definition, the model has not been optimised for the testing data. Therefore performance will be worse than on the training data. However, our r-squared still remains high, so we have built a useful model.

Predict a Property's Value using the Regression Coefficients¶

Our preferred model now has an equation that looks like this:

$$ \log (PR \hat ICE) = \theta _0 + \theta _1 RM + \theta _2 NOX + \theta_3 DIS + \theta _4 CHAS + ... + \theta _{13} LSTAT $$

The average property has the mean value for all its charactistics:

In [71]:
# Starting Point: Average Values in the Dataset
features = data.drop(['PRICE'], axis=1)
average_vals = features.mean().values
property_stats = pd.DataFrame(data=average_vals.reshape(1, len(features.columns)), 
                              columns=features.columns)
property_stats
Out[71]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 3.61 11.36 11.14 0.07 0.55 6.28 68.57 3.80 9.55 408.24 18.46 356.67 12.65

Challenge

Predict how much the average property is worth using the stats above. What is the log price estimate and what is the dollar estimate? You'll have to reverse the log transformation with .exp() to find the dollar value.

In [72]:
regression_2.intercept_
Out[72]:
4.0599438717751966
In [73]:
regression_2.coef_
Out[73]:
array([-1.06717261e-02,  1.57929102e-03,  2.02989827e-03,  8.03305301e-02,
       -7.04068057e-01,  7.34044072e-02,  7.63301755e-04, -4.76332789e-02,
        1.45651350e-02, -6.44998303e-04, -3.47947628e-02,  5.15896157e-04,
       -3.13900565e-02])
In [74]:
regression_2.coef_.ndim
Out[74]:
1
In [75]:
property_stats.values
Out[75]:
array([[3.61352356e+00, 1.13636364e+01, 1.11367787e+01, 6.91699605e-02,
        5.54695059e-01, 6.28463439e+00, 6.85749012e+01, 3.79504269e+00,
        9.54940711e+00, 4.08237154e+02, 1.84555336e+01, 3.56674032e+02,
        1.26530632e+01]])

Tests multiplying arrays

In [76]:
np.array([1,2,3]) * np.array([1,4,1])
Out[76]:
array([1, 8, 3])
In [77]:
np.array([1,2,3]) * np.array([1,4,1]).T
Out[77]:
array([1, 8, 3])

Using the .dot() to do the "Dot product of two arrays."

Specifically, if both a and b are 1-D arrays, it is inner product of vectors.

In [78]:
np.dot(np.array([1,2,3]), np.array([1,1,1]))
Out[78]:
6

YES!

In [79]:
predic_log_price_avg = float(regression_2.intercept_ \
    + np.dot(property_stats.values, regression_2.coef_))

print(predic_log_price_avg)
3.030287230563395
In [80]:
predic_price_avg = np.exp(predic_log_price_avg)
print(f"The predicted price for the average property is:"\
      f" ${predic_price_avg*1000:,.0f}")
print(f"The average house price from our data is ${data.PRICE.mean()*1000:,.0f}")
The predicted price for the average property is: $20,703
The average house price from our data is $22,533

Course Solution: I don't need the .dot() function to multiply arrays, I can just use the .predict() function from our LinearRegression object.

In [81]:
display(regression_2.predict(property_stats))
display(np.exp(regression_2.predict(property_stats)))
array([3.03028723])
array([20.70317832])

Course: A property with an average value for all the features has a value of $20,700.

Challenge

Keeping the average values for CRIM, RAD, INDUS and others, value a property with the following characteristics:

In [82]:
# Define Property Characteristics
next_to_river = True
nr_rooms = 8
students_per_classroom = 20 
distance_to_town = 5
pollution = data.NOX.quantile(q=0.75) # high
amount_of_poverty =  data.LSTAT.quantile(q=0.25) # low
In [83]:
# Solution:
property_stats_2 = property_stats
if next_to_river:
    property_stats_2.CHAS = 1
else:
    property_stats.CHAS = 0
property_stats_2.RM = nr_rooms
property_stats_2.PTRATIO = students_per_classroom
property_stats_2.DIS = distance_to_town
property_stats_2.NOX = pollution
property_stats_2.LSTAT = amount_of_poverty
property_stats_2
Out[83]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 3.61 11.36 11.14 1 0.62 8 68.57 5 9.55 408.24 20 356.67 6.95
In [84]:
predic_price_spec = np.exp(regression_2.predict(property_stats_2)[0]) * 1000
print(f"The predicted price for a property with these specific "\
      f"features is ${predic_price_spec:,.0f}")
The predicted price for a property with these specific features is $25,792

Learning Points¶

Today we've learned:

  • How to quickly spot relationships in a dataset using Seaborn's .pairplot(). The graph generated can be quite big. If the DataFrame has 5 columns/parameters, they each each be paired, so we will have a graph with 25 plots. Lots of options for this plot, see the documentation. By defaut, histogram distribution in the diagonal, and scatter plot for the pairings. But we can also have a regplot instead (with param kind='reg'), which is essentially a scatter with a regression line. We can pass arguments to each different type of plots, see the example.

  • How to look at some of the relationships/pairings in more details with Seaborn's .jointplot(). Here also we can highly customize the plot with the params kind= and in passing parameters such as:

joint_kws={'scatter_kws': {'edgecolor': 'white', 'alpha': 0.5},
           'line_kws': {'color': 'orange'},
           'lowess': True}
  • How to split the data into a training and testing dataset to better evaluate a model's performance. Extremely useful feature. And quite easy to execute too.
from sklearn.model_selection import train_test_split
X = data.drop('PRICE', axis='columns')
y = data.PRICE
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=10)
  • How to run a multivariable regression.
regression = LinearRegression()
regression.fit(X_train, y_train)

regression.intercept_
regression.coef_
rsquare = regression.score(X_train, y_train)
  • How to evaluate that regression-based on the sign of its coefficients:
coeffs = pd.DataFrame(regression.coef_, index=data.iloc[:, :13].columns, columns=['Coef']).T
  • How to run the prediction (.predict()) and the residuals, then analyse and look for patterns in a model's residuals, especially looking at the mean and the skewness.
predicted_values = regression.predict(X_train)
residuals = (y_train - predicted_values)

# Plot actual vs predicted values
sns.scatterplot(x=y_train, y=predicted_values)
sns.lineplot(x=y_train, y=y_train)

# Plot residuals vs predicted
sns.scatterplot(x=predicted_values, y=residuals)

# Mean and skew
resid_mean = round(residuals.mean(), 2)
resid_skew = round(residuals.skew(), 2)

# Checking distribution symmetry
sns.displot(x=residuals, bins=50, kde=True)
  • How to improve a regression model using (a log) data transformation.
# Checking target distr. symmetry
tgt_skew = data.PRICE.skew()
sns.displot(x=data.PRICE, bins=50, kde=True)

# Checking again after data transform (e.g. log)
log_skew = np.log(data.PRICE).skew()
sns.displot(x=np.log(data.PRICE), bins=50, kde=True)
  • How to specify your own values for various features and use your model to make a prediction: .predict()

  • That it can be interesting to display some key values in the plot title (e.g. avg/mean, r-squared, skew)

  • How to change the display options of NumPy with .set_printoptions(), especially for the display of arrays: threshold and edgeitems. We can also change the display options temporarily like so:

with np.printoptions(threshold=500, edgeitems=10):
    display(yes_no_river)
  • How to display a bar graph with just the .values of a Series on the y-axis and create on our own list of labels on the x-axis, like so fig = px.bar(x=['No', 'Yes'], y=river_access.values)

Today we've reviewed:

  • How to get the the column names as a list: data.columns

  • How to plot a distribution with Seaborn's .displot() and how to add a density distribution (KDE) to that histogram with the param kde=True, and controlling the nb of bins with bins=, and changing the bins' border color with ec=

  • How to use the height= and aspect= parameters on some Seaborn figures, when I can not resize with plt.figure(figsize=(width,height))